05.Clustering with Pycaret

Geospatial Analysis of the 2023 Earthquakes in Turkey
Master Thesis
Master of Data Science

Gozde Yazganoglu (gozde.yazganoglu@cunef.edu)

Table of Contents¶

  1. Introduction
  2. Importing libraries
  3. Reading the data and Setup
  4. Model Selelction
  5. Visualization and Prediction of the Model
  6. Model interpretation

1. Introduction ¶

Back to Table of Contents

In this notebook just as the previous one we will be using Pycaret to understand about clustering models.

In the geographic notebooks, we have observed that some clustering existing but we have only observed geographic locations. Here we will be able to observe in dataset as a whole and what else is affecting to these.

2. Importing libraries ¶

Back to Table of Contents

In this notebook we are using new_pycaret.yaml. In order to run local this environment should be installed.

In this notebook we use a basic pandas library in order not to have problems with the environment. However data reserves geographic information such as latitude, longitude, lags and distances. If exist clusters what are them and where are them?

In [1]:
#importing libraries from new_pycaret environment
from pycaret.clustering import *

import pandas as pd
import pickle
import numpy as np

3. Reading the Data ,Setup and Modeling¶

Clustering in PyCaret:

  1. Setup:

Just like other modules in PyCaret, we begin with the setup function, where we preprocess and set up the data for clustering.

  1. Model Creation:

We can create a clustering model using the create_model function. For example, to create a K-Means clustering model:

  1. Model Visualization:

We can visualize cluster results using various plots, like the Elbow plot, Silhouette plot, etc.

  1. Assigning Labels:

Once we've chosen the best number of clusters, we can assign the data points to the respective clusters.

Advantages of Clustering in PyCaret:

Simplicity and Efficiency: 

PyCaret's clustering module allows users to quickly set up and execute clustering algorithms with minimal code.

Integrated Visualization:

The library comes with built-in visualization tools that make it easy to analyze and interpret clustering results.

Variety of Algorithms: 

PyCaret provides a range of clustering algorithms including K-Means, Agglomerative, DBSCAN, and more.

Preprocessing Included: 

PyCaret's setup function handles many preprocessing tasks, such as scaling, automatically. This is essential for clustering since algorithms like K-Means are sensitive to feature scales.

Flexibility: 

We can customize the preprocessing pipeline or use external models and tools if needed.

In [2]:
#reading pandas dataframe
data = pd.read_csv('../data/processed/df.csv')
In [3]:
data.columns
Out[3]:
Index(['obj_type', 'info', 'damage_gra', 'locality', 'population', 'income',
       'total_sales', 'second_sales', 'water_access', 'elec_cons',
       'building_perm', 'land_permited', 'labour_fource', 'unemployment',
       'agricultural', 'life_time', 'hb_per100000', 'fertility', 'hh_size',
       'longitude', 'latitude', 'nearest_water_source_distance',
       'nearest_camping_distance', 'nearest_earthquake_distance',
       'nearest_fault_distance', 'elev', 'percentage', 'damaged_percentage',
       'destroyed_percentage', 'spatial_lag', 'lag_percentage',
       'std_percentage', 'std_lag_percentage', 'lag_damaged_percentage',
       'std_damaged_percentage', 'std_lag_damaged_percentage',
       'lag_destroyed_percentage', 'std_destroyed_percentage',
       'std_lag_destroyed_percentage', 'lag_nearest_water_source_distance',
       'std_nearest_water_source_distance',
       'std_lag_nearest_water_source_distance', 'lag_nearest_camping_distance',
       'std_nearest_camping_distance', 'std_lag_nearest_camping_distance',
       'lag_nearest_earthquake_distance', 'std_nearest_earthquake_distance',
       'std_lag_nearest_earthquake_distance', 'lag_nearest_fault_distance',
       'std_nearest_fault_distance', 'std_lag_nearest_fault_distance',
       'lag_damage_gra', 'std_damage_gra', 'std_lag_damage_gra'],
      dtype='object')

We excluded the category "damage_gra = 0" as it denotes buildings with an undetermined status.

Additionally, we omitted 'percentage' and 'std_damage_gra' since they closely correlate with the damage_gra value.

In [4]:
#Remowing the 0 damage grade

data = data[data['damage_gra'] != 0]

data.groupby('damage_gra').count()

#removing percentage since it is directly correlated with damage grade

data.drop('percentage', axis=1, inplace=True)
data.drop('std_damage_gra', axis=1, inplace=True)
data.drop('std_percentage', axis=1, inplace=True)
data.drop('std_lag_damage_gra', axis=1, inplace=True)
data.drop('lag_percentage', axis=1, inplace=True)

data.tail()
Out[4]:
obj_type info damage_gra locality population income total_sales second_sales water_access elec_cons ... lag_nearest_camping_distance std_nearest_camping_distance std_lag_nearest_camping_distance lag_nearest_earthquake_distance std_nearest_earthquake_distance std_lag_nearest_earthquake_distance lag_nearest_fault_distance std_nearest_fault_distance std_lag_nearest_fault_distance lag_damage_gra
98790 212_RAILWAYS 997_NOT_APPLICABLE 1 TURKOGLU 78976 5997 1938 536 0.95 4343 ... 0.012445 -0.364072 -0.364657 0.092105 -1.097597 -1.097274 0.031475 -1.551232 -1.552713 1.0
98791 212_RAILWAYS 997_NOT_APPLICABLE 1 TURKOGLU 78976 5997 1938 536 0.95 4343 ... 0.012365 -0.362477 -0.364976 0.092183 -1.098330 -1.097127 0.031408 -1.547625 -1.553434 1.0
98792 212_RAILWAYS 997_NOT_APPLICABLE 1 TURKOGLU 78976 5997 1938 536 0.95 4343 ... 0.012396 -0.363093 -0.364853 0.092170 -1.098204 -1.097152 0.031503 -1.552762 -1.552407 1.0
98793 212_RAILWAYS 997_NOT_APPLICABLE 1 TURKOGLU 78976 5997 1938 536 0.95 4343 ... 0.012494 -0.365055 -0.364460 0.092054 -1.097125 -1.097368 0.031509 -1.553062 -1.552347 1.0
98794 212_RAILWAYS 997_NOT_APPLICABLE 1 TURKOGLU 78976 5997 1938 536 0.95 4343 ... 0.012537 -0.365915 -0.364288 0.092011 -1.096724 -1.097448 0.031543 -1.554889 -1.551981 1.0

5 rows × 49 columns

Unlike classification object, in clustering we have to use normal setup as unique option.

K-Means and DBSCAN are popular clustering algorithms, but they operate on different principles and are suited to different types of data and use cases. Here's a comparison of the two:

K-Means:¶

Method:

Partitional clustering method. Iteratively assigns points to clusters by minimizing the sum of squared distances from points to their assigned cluster centers.

Number of Clusters:

Must be specified a priori. The choice of the number of clusters (k) can be guided by methods like the Elbow method, Silhouette score, etc.

Shape of Clusters:

Assumes that clusters are spherical and equally sized. Can struggle with non-spherical clusters or clusters of different densities.

Noise & Outliers:

Sensitive to noise and outliers, which can heavily influence the position of cluster centroids.

Initialization:

Depends on the initial placement of centroids. Common methods include random initialization and the K-Means++ initialization. Multiple runs with different initializations might be needed due to the possibility of convergence to local optima.

Scalability:

Relatively scalable, but can be computationally intensive for a very large number of data points. Variants such as MiniBatch K-Means can help in those cases.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise):¶

Method:

Density-based clustering method. Groups together points that are closely packed together in the data space, marking low-density regions as outliers.

Number of Clusters:

Does not require the number of clusters to be specified in advance. Automatically determines clusters based on data density.

Shape of Clusters:

Can find arbitrarily shaped clusters. Works well with clusters of similar density.

Noise & Outliers:

Explicitly handles noise and outliers by classifying them as points not belonging to any cluster.

Initialization:

Does not depend on initialization as K-Means does.

Parameters:

Requires specification of two main parameters: eps (defines the radius around a data point to look for neighbors) and min_samples (the minimum number of points needed to form a dense region). The choice of these parameters can significantly affect clustering results, and they might not be intuitive to set.

Scalability:

Less scalable for very large datasets as it requires distance computation between points. However, optimized versions and approximations exist to make it more scalable.

In [8]:
cluster = setup(data)
kmeans = create_model('kmeans')
  Description Value
0 Session id 6993
1 Original data shape (98272, 49)
2 Transformed data shape (98272, 120)
3 Numeric features 46
4 Categorical features 3
5 Preprocess True
6 Imputation type simple
7 Numeric imputation mean
8 Categorical imputation mode
9 Maximum one-hot encoding -1
10 Encoding method None
11 CPU Jobs -1
12 Use GPU False
13 Log Experiment False
14 Experiment Name cluster-default-name
15 USI 8f7c
  Silhouette Calinski-Harabasz Davies-Bouldin Homogeneity Rand Index Completeness
0 0.7315 645515079.2386 0.2921 0 0 0
In [9]:
list_kmeans = ['cluster','tsne','elbow','silhouette']

for plot in list_kmeans:
    plot_model(kmeans, plot = plot)
    
No description has been provided for this image
No description has been provided for this image
In [10]:
dbscan = create_model('dbscan')
  Silhouette Calinski-Harabasz Davies-Bouldin Homogeneity Rand Index Completeness
0 0.4529 82.9497 1.4016 0 0 0
In [11]:
list_dbscan = ['cluster','tsne','distance', 'silhouette', 'distribution']

for plot in list_dbscan:
    plot_model(dbscan, plot=plot)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/yellowbrick/utils/wrapper.py:48, in Wrapper.__getattr__(self, attr)
     47 try:
---> 48     return getattr(self._wrapped, attr)
     49 except AttributeError as e:

AttributeError: 'DBSCAN' object has no attribute 'cluster_centers_'

The above exception was the direct cause of the following exception:

YellowbrickAttributeError                 Traceback (most recent call last)
File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/internal/pycaret_experiment/tabular_experiment.py:1106, in _TabularExperiment._plot_model.<locals>.distance()
   1105     visualizer = InterclusterDistance(estimator, **plot_kwargs)
-> 1106     return show_yellowbrick_plot(
   1107         visualizer=visualizer,
   1108         X_train=self.X_train_transformed,
   1109         y_train=None,
   1110         X_test=None,
   1111         y_test=None,
   1112         name=plot_name,
   1113         handle_test="",
   1114         scale=scale,
   1115         save=save,
   1116         fit_kwargs=fit_kwargs,
   1117         display_format=display_format,
   1118     )
   1119 except Exception:

File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/internal/plots/yellowbrick.py:87, in show_yellowbrick_plot(visualizer, X_train, y_train, X_test, y_test, name, handle_train, handle_test, scale, save, fit_kwargs, display_format, **kwargs)
     86     logger.info("Fitting Model")
---> 87     visualizer.fit(X_train, y_train, **fit_kwargs_and_kwargs)
     88 elif handle_train == "fit_transform":

File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/yellowbrick/cluster/icdm.py:291, in InterclusterDistance.fit(self, X, y)
    289 # Get the centers
    290 # TODO: is this how sklearn stores all centers in the model?
--> 291 C = self.cluster_centers_
    293 # Embed the centers in 2D space and get the cluster scores

File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/yellowbrick/utils/wrapper.py:50, in Wrapper.__getattr__(self, attr)
     49 except AttributeError as e:
---> 50     raise YellowbrickAttributeError(f"neither visualizer '{self.__class__.__name__}' nor wrapped estimator '{type(self._wrapped).__name__}' have attribute '{attr}'") from e

YellowbrickAttributeError: neither visualizer 'InterclusterDistance' nor wrapped estimator 'DBSCAN' have attribute 'cluster_centers_'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[11], line 4
      1 list_dbscan = ['cluster','tsne','distance', 'silhouette', 'distribution']
      3 for plot in list_dbscan:
----> 4     plot_model(dbscan, plot=plot)

File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/utils/generic.py:965, in check_if_global_is_not_none.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    963     if globals_d[name] is None:
    964         raise ValueError(message)
--> 965 return func(*args, **kwargs)

File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/clustering/functional.py:755, in plot_model(model, plot, feature, label, scale, save, display_format)
    686 @check_if_global_is_not_none(globals(), _CURRENT_EXPERIMENT_DECORATOR_DICT)
    687 def plot_model(
    688     model,
   (...)
    694     display_format: Optional[str] = None,
    695 ) -> Optional[str]:
    696     """
    697     This function analyzes the performance of a trained model.
    698 
   (...)
    753 
    754     """
--> 755     return _CURRENT_EXPERIMENT.plot_model(
    756         model,
    757         plot=plot,
    758         feature_name=feature,
    759         label=label,
    760         scale=scale,
    761         save=save,
    762         display_format=display_format,
    763     )

File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/clustering/oop.py:175, in ClusteringExperiment.plot_model(self, estimator, plot, scale, save, fold, fit_kwargs, plot_kwargs, groups, feature_name, label, use_train_data, verbose, display_format)
    100 def plot_model(
    101     self,
    102     estimator,
   (...)
    114     display_format: Optional[str] = None,
    115 ) -> Optional[str]:
    116     """
    117     This function analyzes the performance of a trained model.
    118 
   (...)
    173 
    174     """
--> 175     return super().plot_model(
    176         estimator,
    177         plot,
    178         scale,
    179         save,
    180         fold,
    181         fit_kwargs,
    182         plot_kwargs,
    183         groups,
    184         feature_name,
    185         label,
    186         use_train_data,
    187         verbose,
    188         display_format,
    189     )

File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/internal/pycaret_experiment/tabular_experiment.py:2052, in _TabularExperiment.plot_model(self, estimator, plot, scale, save, fold, fit_kwargs, plot_kwargs, groups, feature_name, label, use_train_data, verbose, display_format)
   1939 def plot_model(
   1940     self,
   1941     estimator,
   (...)
   1953     display_format: Optional[str] = None,
   1954 ) -> Optional[str]:
   1955     """
   1956     This function takes a trained model object and returns a plot based on the
   1957     test / hold-out set. The process may require the model to be re-trained in
   (...)
   2050 
   2051     """
-> 2052     return self._plot_model(
   2053         estimator=estimator,
   2054         plot=plot,
   2055         scale=scale,
   2056         save=save,
   2057         fold=fold,
   2058         fit_kwargs=fit_kwargs,
   2059         plot_kwargs=plot_kwargs,
   2060         groups=groups,
   2061         feature_name=feature_name,
   2062         label=label,
   2063         use_train_data=use_train_data,
   2064         verbose=verbose,
   2065         display_format=display_format,
   2066     )

File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/internal/pycaret_experiment/tabular_experiment.py:1919, in _TabularExperiment._plot_model(self, estimator, plot, scale, save, fold, fit_kwargs, plot_kwargs, groups, feature_name, label, use_train_data, verbose, system, display, display_format)
   1917 # execute the plot method
   1918 with redirect_output(self.logger):
-> 1919     ret = locals()[plot]()
   1920 if ret:
   1921     plot_filename = ret

File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/internal/pycaret_experiment/tabular_experiment.py:1122, in _TabularExperiment._plot_model.<locals>.distance()
   1120 self.logger.error("Distance plot failed. Exception:")
   1121 self.logger.error(traceback.format_exc())
-> 1122 raise TypeError("Plot Type not supported for this model.")

TypeError: Plot Type not supported for this model.
No description has been provided for this image

4. Model Selection:¶

K-Means has superior Silhouette, Calinski-Harabasz, and Davies-Bouldin scores, suggesting better, more distinct, and well-separated clusters compared to DBSCAN.Both algorithms perform poorly in terms of homogeneity, Rand Index, and completeness. This might indicate that if there are ground-truth class labels, neither clustering algorithm aligns well with them. This is due to buildings do not spread homogenously.

Given the data, K-Means seems to be the better clustering model in terms of defining distinct clusters. We can try to profile what is the profile for these clusters.

In [12]:
# creating a dataframe with kmeans clusters
kmeans_df = assign_model(kmeans)
In [15]:
kmeans_df.tail()
Out[15]:
obj_type info damage_gra locality population income total_sales second_sales water_access elec_cons ... std_nearest_camping_distance std_lag_nearest_camping_distance lag_nearest_earthquake_distance std_nearest_earthquake_distance std_lag_nearest_earthquake_distance lag_nearest_fault_distance std_nearest_fault_distance std_lag_nearest_fault_distance lag_damage_gra Cluster
98790 212_RAILWAYS 997_NOT_APPLICABLE 1 TURKOGLU 78976 5997 1938 536 0.95 4343 ... -0.364072 -0.364657 0.092105 -1.097597 -1.097274 0.031475 -1.551232 -1.552713 1.0 Cluster 0
98791 212_RAILWAYS 997_NOT_APPLICABLE 1 TURKOGLU 78976 5997 1938 536 0.95 4343 ... -0.362477 -0.364976 0.092183 -1.098330 -1.097127 0.031408 -1.547625 -1.553434 1.0 Cluster 0
98792 212_RAILWAYS 997_NOT_APPLICABLE 1 TURKOGLU 78976 5997 1938 536 0.95 4343 ... -0.363093 -0.364853 0.092170 -1.098204 -1.097152 0.031503 -1.552762 -1.552407 1.0 Cluster 0
98793 212_RAILWAYS 997_NOT_APPLICABLE 1 TURKOGLU 78976 5997 1938 536 0.95 4343 ... -0.365055 -0.364460 0.092054 -1.097125 -1.097368 0.031509 -1.553062 -1.552347 1.0 Cluster 0
98794 212_RAILWAYS 997_NOT_APPLICABLE 1 TURKOGLU 78976 5997 1938 536 0.95 4343 ... -0.365915 -0.364288 0.092011 -1.096724 -1.097448 0.031543 -1.554889 -1.551981 1.0 Cluster 0

5 rows × 50 columns

In [17]:
kmeans_df.groupby('Cluster').describe()
Out[17]:
damage_gra population ... std_lag_nearest_fault_distance lag_damage_gra
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
Cluster
Cluster 0 50971.0 1.091483 0.439657 1.0 1.0 1.0 1.0 4.0 50971.0 1.394911e+06 ... 1.025527 1.741478 50971.0 1.095019 0.333754 0.0 1.0 1.0 1.0 4.0
Cluster 1 10694.0 1.384982 0.849468 1.0 1.0 1.0 1.0 4.0 10694.0 3.341033e+05 ... -1.268409 -1.019502 10694.0 1.401590 0.717199 1.0 1.0 1.0 1.4 4.0
Cluster 2 11941.0 1.000586 0.030347 1.0 1.0 1.0 1.0 3.0 11941.0 2.170110e+06 ... 1.016423 1.496670 11941.0 0.999933 0.025229 0.0 1.0 1.0 1.0 1.6
Cluster 3 24666.0 1.061137 0.349672 1.0 1.0 1.0 1.0 4.0 24666.0 1.813991e+05 ... -0.385643 -0.211827 24666.0 1.060148 0.231372 0.4 1.0 1.0 1.0 4.0

4 rows × 368 columns

5. Visualization and Prediction of the model:¶

Thanks to pycaret, we were able to plot several graphs before with a few lines of code. Now I would like to visualize some other.

In [21]:
#tabular distribution for categorical variables.


print(pd.crosstab(kmeans_df['Cluster'], kmeans_df['obj_type']))
print(pd.crosstab(kmeans_df['Cluster'], kmeans_df['locality']))
print(pd.crosstab(kmeans_df['Cluster'], kmeans_df['info']))
obj_type   11_RESIDENTIAL_BUILDINGS  12_NON_RESIDENTIAL_BUILDINGS  \
Cluster                                                             
Cluster 0                      9457                          1587   
Cluster 1                      5124                           123   
Cluster 2                      5953                           114   
Cluster 3                     15395                           456   

obj_type   211_HIGHWAYS__STREETS_AND_ROADS  212_RAILWAYS  213_AIRFIELD  \
Cluster                                                                  
Cluster 0                            29199           315            10   
Cluster 1                             5447             0             0   
Cluster 2                             5552             0             5   
Cluster 3                             8371            18             0   

obj_type   214_BRIDGES__ELEVATED_HIGHWAYS__TUNNELS_AND_SUBWAYS  \
Cluster                                                          
Cluster 0                                                  8     
Cluster 1                                                  0     
Cluster 2                                                  0     
Cluster 3                                                  7     

obj_type   22_PIPELINES__COMMUNICATION_AND_ELECTRICITY_LINES  \
Cluster                                                        
Cluster 0                                                  7   
Cluster 1                                                  0   
Cluster 2                                                  0   
Cluster 3                                                  8   

obj_type   23_COMPLEX_CONSTRUCTIONS_ON_INDUSTRIAL_SITES  \
Cluster                                                   
Cluster 0                                             1   
Cluster 1                                             0   
Cluster 2                                             2   
Cluster 3                                             0   

obj_type   24_OTHER_CIVIL_ENGINEERING_WORKS  995_UNCLASSIFIED  
Cluster                                                        
Cluster 0                               229             10158  
Cluster 1                                 0                 0  
Cluster 2                               315                 0  
Cluster 3                                14               397  
locality   ADIYAMAN  AFSIN  ANTAKYA  BAHCE  DIYARBAKIR  DUZICI  ELBISTAN  \
Cluster                                                                    
Cluster 0         0    712        0      0         890       0      1688   
Cluster 1         0      0     8196      0           0       0         0   
Cluster 2         0      0        0      0           0       0         0   
Cluster 3      6772      0        0   3632           0    7995         0   

locality   ERDEMOÄ_LU  GAZIANTEP  GOLBASI  ISLAHIYE  KAHRAMANMARAS  KIRIKHAN  \
Cluster                                                                        
Cluster 0           0      24025        0      1404          12584         0   
Cluster 1           0          0        0         0              0      2498   
Cluster 2           0          0        0         0              0         0   
Cluster 3         580          0      429         0              0         0   

locality   MALATYA  NURDAGI  OSMANIYE  PAZARCIK  SANLIURFA  TURKOGLU  
Cluster                                                               
Cluster 0     8269      817         0        97          0       485  
Cluster 1        0        0         0         0          0         0  
Cluster 2        0        0         0         0      11941         0  
Cluster 3        0        0      5258         0          0         0  
info       1211_HOTEL_BUILDINGS  1220_ADMINISTRATIVE  1221_INSTITUTIONAL  \
Cluster                                                                    
Cluster 0                    20                    4                  27   
Cluster 1                     0                    0                   0   
Cluster 2                     0                    0                   0   
Cluster 3                     1                    0                   1   

info       1222_POLICE_STATION  1223_FIRE_STATION  122_OFFICE_BUILDINGS  \
Cluster                                                                   
Cluster 0                   13                  6                    11   
Cluster 1                    0                  0                     2   
Cluster 2                    0                  0                     0   
Cluster 3                    1                  0                     1   

info       123_WHOLESALE_AND_RETAIL_TRADE_BUILDINGS  \
Cluster                                               
Cluster 0                                       101   
Cluster 1                                         0   
Cluster 2                                         0   
Cluster 3                                         0   

info       1241_COMMUNICATION_BUILDINGS__STATIONS__TERMINALS_AND_ASSOCIATED_BUILDINGS  \
Cluster                                                                                 
Cluster 0                                                  4                            
Cluster 1                                                  0                            
Cluster 2                                                  0                            
Cluster 3                                                  1                            

info       1251_INDUSTRIAL_BUILDINGS  1252_RESERVOIRS__SILOS_AND_WAREHOUSES  \
Cluster                                                                       
Cluster 0                        108                                      1   
Cluster 1                         53                                     31   
Cluster 2                          0                                      0   
Cluster 3                        280                                      0   

info       ...  2141_BRIDGES_AND_ELEVATED_HIGHWAYS  2142_TUNNELS_AND_SUBWAYS  \
Cluster    ...                                                                 
Cluster 0  ...                                   8                         0   
Cluster 1  ...                                   0                         0   
Cluster 2  ...                                   0                         0   
Cluster 3  ...                                   6                         1   

info       2214_LONG_DISTANCE_ELECTRICITY_LINES  \
Cluster                                           
Cluster 0                                     7   
Cluster 1                                     0   
Cluster 2                                     0   
Cluster 3                                     5   

info       221_LONG_DISTANCE_PIPELINES__COMMUNICATION_AND_ELECTRICITY_LINES  \
Cluster                                                                       
Cluster 0                                                  0                  
Cluster 1                                                  0                  
Cluster 2                                                  0                  
Cluster 3                                                  1                  

info       2224_LOCAL_ELECTRICITY_AND_TELECOMMUNICATION_CABLES  \
Cluster                                                          
Cluster 0                                                  0     
Cluster 1                                                  0     
Cluster 2                                                  0     
Cluster 3                                                  2     

info       2301_CONSTRUCTIONS_FOR_MINING_OR_EXTRACTION  \
Cluster                                                  
Cluster 0                                            0   
Cluster 1                                            0   
Cluster 2                                            1   
Cluster 3                                            0   

info       2302_POWER_PLANT_CONSTRUCTIONS  2411_SPORTS_GROUNDS  \
Cluster                                                          
Cluster 0                               1                   39   
Cluster 1                               0                    0   
Cluster 2                               1                   32   
Cluster 3                               0                    3   

info       2412_OTHER_SPORT_AND_RECREATION_CONSTRUCTIONS  997_NOT_APPLICABLE  
Cluster                                                                       
Cluster 0                                            190               20024  
Cluster 1                                              0                5311  
Cluster 2                                            283                6261  
Cluster 3                                             11               15830  

[4 rows x 45 columns]
In [25]:
kmeans_df_map = kmeans_df[['latitude', 'longitude', 'Cluster']]
In [26]:
import matplotlib.pyplot as plt

# Create a scatter plot
plt.figure(figsize=(10, 8))
for cluster in kmeans_df_map['Cluster'].unique():
    subset = kmeans_df_map[kmeans_df_map['Cluster'] == cluster]
    plt.scatter(subset['longitude'], subset['latitude'], label=f'Cluster {cluster}')

plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Clusters based on Latitude and Longitude')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
In [30]:
kmeans_damage_gra = kmeans_df[['damage_gra', 'Cluster']]
kmeans_damage_gra.groupby('damage_gra').describe()
Out[30]:
Cluster
count unique top freq
damage_gra
1 92806 4 Cluster 0 48446
2 2010 4 Cluster 0 1057
3 2083 4 Cluster 1 945
4 1373 3 Cluster 0 670
In [32]:
#to have more visualizations

kmeans_df.to_csv('../data/processed/kmeansdf.csv')